-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
read data from an hdf rather than a csv #29
Conversation
column.loc[to_noise_idx], configuration, randomness_stream, additional_key | ||
) | ||
|
||
column.loc[to_noise_idx] = noised_data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was this actually causing a problem or do you just find this more readable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This just made debugging easier since I could put a breakpoint between the function call and the assignment to the series.
src/pseudopeople/interface.py
Outdated
data = pd.read_csv(path, dtype=str, keep_default_na=False) | ||
data = pd.read_hdf(path) | ||
if not isinstance(data, pd.DataFrame): | ||
raise TypeError(f"File located at {path} must contain a pandas DataFrame.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider moving into a load_data
utility function so that we don't forget to check type.
@@ -303,6 +303,7 @@ def keyboard_corrupt(truth, corrupted_pr, addl_pr, rng): | |||
include_original_token_level = configuration.include_original_token_level | |||
|
|||
rng = np.random.default_rng(seed=randomness_stream.seed) | |||
column = column.astype(str) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will convert any NaNs to "nan" and proceed to corrupt that. We shouldn't have any NaNs at this point though, right? B/c those get dropped up front when this gets called?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's correct, but definitely great to call this out
Read data from HDF rather than CSV
Description
Read data in from HDF rather than CSV
Fix errors in
incorrect_select_options.csv
Fix issue with categorical dtypes by using NA instead of "" for missing data
Cast column to str dtype for typographic errors.
Testing
Ran integration tests against sample data generated by the updated simulation which outputs hdfs.
Ran automated test suite